Introduction

We will be working with data that are sets of ordered pairs of numbers. The ordered pairs are labeled \((x_i,y_i )\) for \(i = 1\) to \(n\). The \(y\)’s are the called the response (dependent) variable and the \(x\)’s are called the control (independent) variable.

Simple Linear Regression is widely used to approach the possible relationship between \(x\) and \(y\), as long as the following assumptions are satisfied.

Assumptions for Simple Linear Regression Model:

  1. Each \(y_i\) is related to \(x_i\) via the equation \(y_i=\beta_0+\beta_1x_i+\epsilon_i\);
  2. The \(x_i\)’s are given: i.e., the \(x_i\)’s are not random variables;
  3. The \(\epsilon_i\)’s are independent normal random variables with mean 0 and standard deviation \(\sigma\);
  4. \(\beta_0, \beta_1\), and \(\sigma\) are unknown population parameters to be estimated from the sample data.

When using simple linear regression, there are two major tasks.

I. Determine whether or not it is reasonable to assume our data was generated by the simple linear regression model.

II. If we answer yes to I., then we need to estimate the numbers \(\beta_0, \beta_1\), and \(\sigma\) and use our estimates to answer questions, make decisions, etc.

The estimates for \(\beta_0, \beta_1\), and \(\sigma\) are usually called \(\widehat{\beta_0}, \widehat{\beta_1}\), and \(\widehat{\sigma}\). And once we have found \(\widehat{\beta_0}, \widehat{\beta_1}\), and \(\widehat{\sigma}\), then given a value of \(x\), we can use \(\widehat{\beta_0}, \widehat{\beta_1}\), and \(\widehat{\sigma}\) to estimate the mean value of the response variable \(y\). This estimate is called \(\widehat{y}\) and is calculated as follows:

\[ \widehat{y}=\widehat{\beta_0}+\widehat{\beta_1}x. \]

The residuals \(\widehat{e_i}\) are given by the following equation

\[ \widehat{e_i}=y_i-\widehat{y_i}=y_i-(\widehat{\beta_0}+\widehat{\beta_1}x_i) \]

and are central in analyzing the simple linear regression model.   Analyzing the residuals can lead to further evidence the assumptions for simple linear regression have or have not been met.  In addition, if the assumptions for the simple linear regression model are satisfied, the residuals carry information about how reliable our estimates might be.

A straightforward way to begin addressing the assumptions for simple linear regression is to scatter plot the ordered pairs and display the line \(\widehat{y}=\widehat{\beta_0}+\widehat{\beta_1}x\) on the same graph. If the scatter plot of the data points seems to scatter non randomly about the line \(\widehat{y}=\widehat{\beta_0}+\widehat{\beta_1}x\), then it’s likely that one or more of the assumptions for simple linear regression are not satisfied and something else possibly should be done. If the scatter plot looks reasonable then we move forward and do a more thorough analysis of the ordered pairs.

In today’s lab, we will first work will simulated data to understand the elementary processes of simple linear regression, then we will work on real word data, and finally we will consider some special examples about regression.

Part I: Working with Simulated Data

In part I, we first pre-set the relationship between \(x\)’s and \(y\)’s (step 0), then simulate sample pairs \((x_i,y_i)\) based on that pre-set relationship (step 1). After the sample pairs are obtained, we “pretend” that we don’t know that relationship, and wish to approach/estimate that relationship based on the sample data (step 2).

Step 0: We first define that the true relationship between \(x\) and \(y\) is given by

\[ y_i=2+4x_i+\epsilon_i \]

Step 1.1: Define \(x_i\): recall assumption 2, \(x_i\) should be given, not random. Here we define \(x_i\)’s as numbers from 0 through 2 in steps of 0.1. To do that, we use the seq(from, to, by) function. This is the x in part I, we use the object x1 to store the results

x1 <- seq(from=0,to=2,by=0.1)
x1
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
## [20] 1.9 2.0

Step 1.2: Define \(\epsilon_i\): recall assumption 3, \(\epsilon_i\) should be independent normal random variables with mean 0 and fixed standard deviation (here we set \(\sigma=1\)). To simulate such random variables, we use function rnorm(number of random variable to generate, mean, standard deviation):

eps1 <- rnorm(21,0,1)

Step 1.3: Simulate \(y_i\): This is our first model in part I, we store the results in object y1_1

y1_1 <- 2+4*x1+eps1

Up to this point, we obtain the 21 sample pairs of \((x_i,y_i)\). Now, let’s forget the true relationship and try to approach that relationship. If we assume that the relationship is linear, we use simple linear regression analysis to do that.

Step 2.1: Obtain the scatter plot of \((x_i,y_i)\)

plot(y1_1~x1)
title(main="Scatter plot - Model 1 - Your Name")

Does the above scatter plot shows an approximately linear pattern? If yes, we can go to the next sub-step

Step 2.2: Simple Linear Regression Analysis

mod1 <- lm(y1_1~x1)
summary(mod1)
## 
## Call:
## lm(formula = y1_1 ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.51059 -0.59659  0.08304  0.41501  1.74562 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.6999     0.3850   7.013 1.12e-06 ***
## x1            3.4952     0.3293  10.614 2.00e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9138 on 19 degrees of freedom
## Multiple R-squared:  0.8557, Adjusted R-squared:  0.8481 
## F-statistic: 112.7 on 1 and 19 DF,  p-value: 1.999e-09
plot(mod1)

How to interpret the analysis results?

Finally, let’s plot the scatter plot with regression line together

plot(y1_1~x1)
abline(mod1)

What is your observation?

Assignment 1.1

Now mimic the above processes, use the same x1 and eps1, define the second y using y1_2<-3-2*x1+eps1, then obtain the scatter plot, check if the relationship between y1_2 and x1 is approximately linear, if yes, run the simple linear regression analysis (store the result in object mod2), interpret the results, and finally plot the scatter plot together with the regression line.

Remark: When you create the plots, make sure you add the title with your name.

R codes:

y1_2<-3-2*x1+eps1
plot(y1_2~x1)
title(main="Scatter plot - Model 2 - James")

mod2 <- lm(y1_2~x1)
summary(mod2)
## 
## Call:
## lm(formula = y1_2 ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.51059 -0.59659  0.08304  0.41501  1.74562 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.6999     0.3850   9.611 9.94e-09 ***
## x1           -2.5048     0.3293  -7.606 3.52e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9138 on 19 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7398 
## F-statistic: 57.85 on 1 and 19 DF,  p-value: 3.518e-07
plot(mod2)

abline(mod2)

Interpretation:

There are 3 outliers, and the line of best fit doesn't come near crossing any points after 0.10

End of Assignment 1.1

Next, we will explore the cases when the assumptions 1 or 3 for the linear regression analysis are not satisfied.

Assumption 1 says that the relationship between \(x\) and \(y\) are linear. So we define our mod 3 as

\[ y_i=2+4x_i^3+\epsilon_i \]

where \(\epsilon_i\) still satisfies assumption 3, independent normal random variables with mean 0 and standard deviation 1.

Assignment 1.2

Now mimic the above processes, use the same x1 and eps1, define the second y using y1_3<-2+4*x1^3+eps1, then obtain the scatter plot, check if the relationship between y1_3 and x1 is approximately linear, still run the simple linear regression analysis (store the result in object mod3), but check how reliable the result is, and finally plot the scatter plot together with the regression line.

Remark: When you create the plots, make sure you add the title with your name.

R codes:

y1_3<-2+4*x1^3+eps1
plot(y1_3~x1)
title(main="Scatter plot - Model 3 - James")

mod3 <- lm(y1_3~x1)
summary(mod3)
## 
## Call:
## lm(formula = y1_3 ~ x1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.396 -2.972 -0.501  2.346  9.051 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.532      1.651   -2.14   0.0456 *  
## x1            14.127      1.412   10.01 5.22e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.918 on 19 degrees of freedom
## Multiple R-squared:  0.8405, Adjusted R-squared:  0.8321 
## F-statistic: 100.1 on 1 and 19 DF,  p-value: 5.22e-09
plot(mod3)

abline(mod3)

Interpretation:

This model is a lot less sparce than model 2. The relationship between the two is appox. linear. More linear than model 2.

End of Assignment 1.2

What if the assumption 3 is not satisfied?

Assumption 3 says that the error term \(\epsilon\) should be independent normal random variable with mean 0 and fixed standard deviation, so we define our model 4 as

\[ y_i=2+4x_i+\epsilon_i^* \]

where \(\epsilon_i^*\) follows a skewed distribution but still with mean 0.

To do that, use

eps2 <- rlnorm(21,0,1)-exp(1/2)
hist(eps2)

mean(eps2)
## [1] -0.2236425

Assignment 1.3

Now mimic the above processes, use the same x1 and eps1, define the second y using y1_4<-2+4*x1+eps2, then obtain the scatter plot, check if the relationship between y1_4 and x1 is approximately linear, still run the simple linear regression analysis (store the result in object mod3), but check how reliable the result is, and finally plot the scatter plot together with the regression line.

Remark: When you create the plots, make sure you add the title with your name.

R codes:

y1_4<-2+4*x1+eps2
plot(y1_4~x1)
title(main="Scatter plot - Model 4 - James")

mod4 <- lm(y1_4~x1)
summary(mod4)
## 
## Call:
## lm(formula = y1_4 ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6280 -0.8322 -0.4475 -0.2436  4.2811 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2091     0.6257   3.531  0.00223 ** 
## x1            3.5673     0.5352   6.665 2.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.485 on 19 degrees of freedom
## Multiple R-squared:  0.7004, Adjusted R-squared:  0.6847 
## F-statistic: 44.43 on 1 and 19 DF,  p-value: 2.25e-06
plot(mod4)

abline(mod4)

Interpretation:

There seems to only be one outlier on the scatterplot and the result is very reliable. A very linear relationship.

End of Assignment 1.3

Comment: In practice, when non-linearity or non-normal pattern are observed, we would not do a simple linear regression analysis on that data set. Instead we would probably try to transform or preprocess the pairs in some way to make them look more like a straight line and then do a simple regression or we might try a non-linear regression.

Part II: Example with Real-World Data

Read the data file “TestScore.csv”. This data set contains MATH 155 midterms, homework, final exam and course average. Call the object of this data file as TestScore

R code for reading the data:

TestScore <- read.csv("TestScore.csv")

This data is a combined (deidentified) test, homework, final exam and course average data from two sections of Math 155 taught in Fall 2017.

Assignment 2

  1. Create scatter plots of all variables against course average (test 1 vs. course average, test 2 vs. course average etc. a total of 6 graphs). Which response variable is the strongest in predicting course average in your opinion? How do you decide?

    R codes:

    TestScore
    ##      T1   T2   T3 Tave  HW FinalExam CourseAverage
    ## 1  80.7 87.5 79.4 82.5 100        76          85.0
    ## 2  68.6 87.0 79.4 78.3  69        66          77.7
    ## 3  82.9 88.0 84.7 85.2  98        46          75.6
    ## 4  52.1 83.0 59.4 64.9  59        56          64.9
    ## 5   7.9 23.5  0.0 10.5  29         0          12.7
    ## 6  47.1 50.5  0.0 32.5  90        34          39.7
    ## 7  77.1 87.5 81.2 81.9  95        80          82.2
    ## 8  92.1 99.5 92.9 94.9  99        82          91.5
    ## 9  57.9 73.5 70.6 67.3  81        46          64.0
    ## 10 86.4 78.0 73.5 79.3  41        60          70.4
    ## 11 79.3 61.0 74.1 71.5  83        52          69.8
    ## 12 74.3 86.0 66.5 75.6  90        64          75.9
    ## 13 85.0 81.0 82.4 82.8  78        72          81.8
    ## 14 67.1 25.0 58.2 50.1  28        44          47.9
    ## 15 81.4 91.5 91.8 88.2  89        80          86.3
    ## 16 82.1 78.5 83.5 81.4  70        64          73.3
    ## 17 54.3 79.0  0.0 44.4  46         0          38.5
    ## 18 90.7 87.5 98.8 92.3  44        82          84.4
    ## 19 48.6 58.5 44.7 50.6  51        46          54.3
    ## 20 69.3 86.0 62.4 72.5  90        74          76.0
    ## 21 80.7 73.0 62.4 72.0  62        52          65.7
    ## 22 59.3 92.0 86.5 79.3 100        88          86.4
    ## 23 86.4 95.0 95.9 92.4  31        84          82.5
    ## 24 94.3 75.0 86.0 85.1  66        86          83.6
    ## 25 65.0 84.5 85.9 78.5  94        54          74.2
    ## 26 92.9 90.0 95.9 92.9  96        86          91.4
    ## 27  0.0 65.0  0.0 21.7   8         0          13.5
    ## 28 35.0 58.0 62.4 51.8  66        44          54.8
    ## 29 68.6 47.0 33.5 49.7   6        42          46.6
    ## 30 59.3 61.5 64.1 61.6  85        50          61.1
    ## 31 52.9 39.5 21.2 37.8  78        36          42.7
    ## 32 67.9 91.0 74.7 77.9  84        56          73.2
    ## 33 84.3 97.5 88.2 90.0  98        86          91.1
    ## 34 92.1 64.0 93.5 83.2  91        64          80.1
    ## 35 61.4 77.5 75.3 71.4  55        82          71.0
    ## 36 86.4 81.0 80.6 82.7  90        72          82.0
    ## 37 56.4 69.0  0.0 41.8  47         0          33.5
    ## 38 92.1 77.0 52.0 73.7  83        52          70.3
    ## 39 70.7 83.0 47.6 67.1  99        48          67.9
    ## 40 75.0 83.0 64.1 74.0  58        54          64.8
    ## 41 65.0 57.0 46.5 56.2  85        74          65.9
    ## 42 88.6 84.0 79.4 84.0  94        80          84.3
    ## 43 87.9 98.5 85.3 90.6  99        80          88.4
    ## 44 81.4 91.0 66.5 79.6 100        50          75.4
    ## 45 81.4 99.0 80.6 87.0  93        68          80.6
    ## 46 63.6 54.5 52.4 56.8  97        52          63.5
    ## 47 77.1 88.5 92.4 86.0  94        80          86.7
    ## 48 88.6 81.0 91.8 87.1  97        56          80.7
    ## 49 94.3 88.5 90.0 90.9 100        82          89.8
    ## 50 51.4 61.0  0.0 37.5  47         0          32.0
    ## 51 62.9  0.0  0.0 21.0  40         0          23.1
    ## 52  0.0  0.0  0.0  0.0  30         0          11.0
    ## 53 87.9 99.5 97.1 94.8 100        82          91.3
    ## 54 52.9 53.5  0.0 35.5  50         0          31.5
    ## 55 83.6 57.5 67.1 69.4  78        78          73.6
    ## 56 92.1 88.5 95.9 92.2 100        86          91.7
    ## 57 81.4 63.0 64.7 69.7  92        68          74.9
    ## 58 90.0 99.5 91.8 93.8  98        84          92.7
    ## 59 73.6 69.5 71.2 71.4  95        62          69.9
    ## 60 77.9 57.5 59.4 64.9 100        56          66.3
    ## 61 68.6 74.5 82.9 75.3 100        62          75.4
    ## 62 71.4 85.5 70.6 75.8  74        66          74.0
    ## 63 52.1 49.0  0.0 33.7  57         0          26.8
    ## 64 53.6 57.5 53.5 54.9  52        64          58.9
    first<-plot(TestScore$T1~TestScore$CourseAverage)

    second<-plot(TestScore$T2~TestScore$CourseAverage)

    third<-plot(TestScore$T3~TestScore$CourseAverage)

    fourth<-plot(TestScore$Tave~TestScore$CourseAverage)

    fith<-plot(TestScore$HW~TestScore$CourseAverage)

    sixth<-(TestScore$FinalExam~TestScore$CourseAverage)

    Interpretation:

    Tave, and I chose Tave because it would most likely have the tightest line of best fit, being highly related to course average.
  2. Do a complete regression analysis on all 6 pairs. Report 6 regression line equations and R2 values. Which explanatory variable is the best in predicting course average?

    R codes:

    first <- lm(TestScore$T1~TestScore$CourseAverage)
    summary(first)
    ## 
    ## Call:
    ## lm(formula = TestScore$T1 ~ TestScore$CourseAverage)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -26.812  -6.265   1.363   6.973  29.069 
    ## 
    ## Coefficients:
    ##                         Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)             14.75242    4.65029   3.172  0.00235 ** 
    ## TestScore$CourseAverage  0.82592    0.06606  12.503  < 2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 11.19 on 62 degrees of freedom
    ## Multiple R-squared:  0.716,  Adjusted R-squared:  0.7114 
    ## F-statistic: 156.3 on 1 and 62 DF,  p-value: < 2.2e-16
    plot(first)

    abline(first)

    second <- lm(TestScore$T2~TestScore$CourseAverage)
    summary(second)
    ## 
    ## Call:
    ## lm(formula = TestScore$T2 ~ TestScore$CourseAverage)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -36.005  -5.203   1.072   6.752  36.998 
    ## 
    ## Coefficients:
    ##                         Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)             16.74942    5.39837   3.103  0.00289 ** 
    ## TestScore$CourseAverage  0.83356    0.07668  10.870 5.38e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 12.99 on 62 degrees of freedom
    ## Multiple R-squared:  0.6559, Adjusted R-squared:  0.6503 
    ## F-statistic: 118.2 on 1 and 62 DF,  p-value: 5.382e-16
    plot(second)

    abline(second)

    third <- lm(TestScore$T3~TestScore$CourseAverage)
    summary(third)
    ## 
    ## Call:
    ## lm(formula = TestScore$T3 ~ TestScore$CourseAverage)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -24.1153  -6.7844  -0.3785   7.6478  22.6478 
    ## 
    ## Coefficients:
    ##                          Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)             -31.25575    4.13863  -7.552 2.37e-10 ***
    ## TestScore$CourseAverage   1.39474    0.05879  23.724  < 2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 9.962 on 62 degrees of freedom
    ## Multiple R-squared:  0.9008, Adjusted R-squared:  0.8992 
    ## F-statistic: 562.8 on 1 and 62 DF,  p-value: < 2.2e-16
    plot(third)

    abline(third)

    fourth <- lm(TestScore$Tave~TestScore$CourseAverage)
    summary(fourth)
    ## 
    ## Call:
    ## lm(formula = TestScore$Tave ~ TestScore$CourseAverage)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -11.2883  -2.5402  -0.3903   2.9860   8.3330 
    ## 
    ## Coefficients:
    ##                         Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)              0.09153    1.98061   0.046    0.963    
    ## TestScore$CourseAverage  1.01788    0.02813  36.179   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 4.767 on 62 degrees of freedom
    ## Multiple R-squared:  0.9548, Adjusted R-squared:  0.954 
    ## F-statistic:  1309 on 1 and 62 DF,  p-value: < 2.2e-16
    plot(fourth)

    abline(fourth)

    fith <- lm(TestScore$HW~TestScore$CourseAverage)
    summary(fith)
    ## 
    ## Call:
    ## lm(formula = TestScore$HW ~ TestScore$CourseAverage)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -56.991  -7.446   3.636   9.004  38.242 
    ## 
    ## Coefficients:
    ##                         Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)              18.1493     7.4886   2.424   0.0183 *  
    ## TestScore$CourseAverage   0.8466     0.1064   7.958 4.68e-11 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 18.03 on 62 degrees of freedom
    ## Multiple R-squared:  0.5053, Adjusted R-squared:  0.4973 
    ## F-statistic: 63.33 on 1 and 62 DF,  p-value: 4.675e-11
    plot(fith)

    abline(fith)

    sixth <- lm(TestScore$FinalExam~TestScore$CourseAverage)
    summary(sixth)
    ## 
    ## Call:
    ## lm(formula = TestScore$FinalExam ~ TestScore$CourseAverage)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -22.1843  -3.9705   0.6397   5.6142  21.3306 
    ## 
    ## Coefficients:
    ##                          Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)             -23.40575    3.79179  -6.173 5.67e-08 ***
    ## TestScore$CourseAverage   1.18416    0.05386  21.985  < 2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 9.127 on 62 degrees of freedom
    ## Multiple R-squared:  0.8863, Adjusted R-squared:  0.8845 
    ## F-statistic: 483.3 on 1 and 62 DF,  p-value: < 2.2e-16
    plot(sixth)

    abline(sixth)

    Interpretation:

    test1 because its line fits best.
  3. Notice how some final exam scores are zeros. These students didn’t show up on the final exam. Do you think those zeros can affect the outcome of regression analysis? Explain.

    R codes:

    Interpretation:

    Yes, because they are outliers
  4. Now remove pairs of final exam – course average for those that have zeros in final exam. To do that, you can use the code as

    TestScore2<-subset(TestScore, FinalExam!=0)

    Then rerun the regression. Did it change the outcome of the regression analysis? Did it make it stronger or weaker? How do you know?

    R codes:

    sixth <- lm(TestScore2$FinalExam~TestScore2$CourseAverage)
    sixth
    ## 
    ## Call:
    ## lm(formula = TestScore2$FinalExam ~ TestScore2$CourseAverage)
    ## 
    ## Coefficients:
    ##              (Intercept)  TestScore2$CourseAverage  
    ##                   -9.125                     1.004
    summary(sixth)
    ## 
    ## Call:
    ## lm(formula = TestScore2$FinalExam ~ TestScore2$CourseAverage)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -20.8048  -3.0101  -0.0582   3.4770  19.8152 
    ## 
    ## Coefficients:
    ##                          Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)              -9.12454    6.27607  -1.454    0.152    
    ## TestScore2$CourseAverage  1.00436    0.08348  12.031   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 7.943 on 53 degrees of freedom
    ## Multiple R-squared:  0.732,  Adjusted R-squared:  0.7269 
    ## F-statistic: 144.7 on 1 and 53 DF,  p-value: < 2.2e-16
    plot(sixth)

    abline(sixth)

    Interpretation:

    It changed making it weaker. Going from r^2 of .88 to .73
  5. Out of all regression analyses you made, choose the best predictor. Do residuals satisfy necessary conditions? Is the regression analysis valid? Can it be used for estimation and prediction?

    Interpretation:

    Tave is the most valid and can be used for esitmation and prediction.

End of Assignment 2

Part III: Special Cases

A statistician named Frank Anscombe made up this data, to illustrate a point about regression.

Assignment 3

Read the data from data file “FA.csv”, store it in an object called “FA”

Run the regression analysis for Y1 on X, Y2 on X, Y3 on X, and Y4 on X4. Compare the four sets of regression output. What do they all have in common? List all of the commonalities.

If you only looked at the regression output, would you think these data sets were alike?

Produce four scatter plots for the same four pairs of columns--put them all on the same graph but in different panels. What was Anscombe trying to illustrate with this example?

R codes:

thr <- lm(FA$Y3~FA$X)

summary(thr)

plot(thr)

abline(thr)

FA <- read.csv("FA.csv")
FA
##     X    Y1   Y2    Y3 X4    Y4
## 1  10  8.04 9.14  7.46  8  6.58
## 2   8  6.95 8.14  6.77  8  5.76
## 3  13  7.58 8.74 12.74  8  7.71
## 4   9  8.81 8.77  7.11  8  8.84
## 5  11  8.33 9.26  7.81  8  8.47
## 6  14  9.96 8.10  8.84  8  7.04
## 7   6  7.24 6.13  6.08  8  5.25
## 8   4  4.26 3.10  5.39 19 12.50
## 9  12 10.84 9.13  8.15  8  5.56
## 10  7  4.82 7.26  6.42  8  7.91
## 11  5  5.68 4.74  5.73  8  6.89
## 12 NA    NA   NA    NA NA    NA
## 13 NA    NA   NA    NA NA    NA
## 14 NA    NA   NA    NA NA    NA
## 15 NA    NA   NA    NA NA    NA
## 16 NA    NA   NA    NA NA    NA
## 17 NA    NA   NA    NA NA    NA
## 18 NA    NA   NA    NA NA    NA
## 19 NA    NA   NA    NA NA    NA
## 20 NA    NA   NA    NA NA    NA
sixth <- lm(FA$Y1~FA$X)
summary(sixth)
## 
## Call:
## lm(formula = FA$Y1 ~ FA$X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## FA$X          0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217
plot(sixth)

abline(sixth)

bruh <- lm(FA$Y2~FA$X)
summary(bruh)
## 
## Call:
## lm(formula = FA$Y2 ~ FA$X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9009 -0.7609  0.1291  0.9491  1.2691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125   2.667  0.02576 * 
## FA$X           0.500      0.118   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179
plot(bruh)

abline(bruh)

thr <- lm(FA$Y3~FA$X)
summary(thr)
## 
## Call:
## lm(formula = FA$Y3 ~ FA$X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1586 -0.6146 -0.2303  0.1540  3.2411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0025     1.1245   2.670  0.02562 * 
## FA$X          0.4997     0.1179   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176
plot(thr)

abline(thr)

fr <- lm(FA$Y4~FA$X4)
summary(fr)
## 
## Call:
## lm(formula = FA$Y4 ~ FA$X4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0017     1.1239   2.671  0.02559 * 
## FA$X4         0.4999     0.1178   4.243  0.00216 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
## F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165
plot(fr)

abline(fr)

par(mfrow = c(2, 2))

plot(FA$Y1~FA$X)
plot(FA$Y2~FA$X)
plot(FA$Y3~FA$X)
plot(FA$Y4~FA$X4)

Interpretation:

They all have a r^2 of .62. If I only looked at the outputs I wouldn't think they have anything in commmon. Anscomb wanted to show that even with varying in data spread the summary of the data can be very similar.

End of Assignment 3